Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move dataclass to dedicated module #3195

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

LecrisUT
Copy link
Contributor

@LecrisUT LecrisUT commented Sep 5, 2024

A few reasons for this move:

A few things to point out:

  • All dataclass and field usage are now unified and populate the metadata. In principle this should be fine since it would just populate a field that would never be used, but should we do something else there to distinguish between DataContainer type of dataclasses and plain dataclass like the ones in hardware.py?
  • The import syntax is a bit disjoint. The naming of dataclasses and the from import is done like this in order to minimize the diffs for the purpose of initial review, but I think we should change it to be more uniform w.r.t. full name tmt.dataclasses.dataclass vs short name field

Pull Request Checklist

  • implement the feature
  • write the documentation

@LecrisUT
Copy link
Contributor Author

LecrisUT commented Sep 5, 2024

What do you think about putting these into a package tmt.containers and moving the remaining implementations there as well.

Copy link
Collaborator

@martinhoyer martinhoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My word should have very little weight here, but fwiw I like it. To me it makes things a bit more 'digestable'.

I wonder if we shouldn't come up with a master plan for what do we want in what file, subpackage, etc, so there is a clear logic and goal to stive for.

As for attrs, can't comment at the moment, but am educating myself about attrs vs dataclasses usage.

@happz
Copy link
Collaborator

happz commented Sep 5, 2024

What do you think about putting these into a package tmt.containers...

+1 to tmt.containers. I have it somewhere in my stash, I tried something similar: move basic classes and functions and helpers like field() or container_* out of tmt.utils. It'd be aligned with the goal of separating code from the implementation detail like the dataclasses (or attrs, when the time comes).

... and moving the remaining implementations there as well.

Which implementations? If you mean SerializableContainer and other "data container" classes, then yep, I'd like to do that as well. My idea was to have all these helpers and classes in their own module, both to simplify tmt.utils and keep the rest of the code clean.

@LecrisUT LecrisUT marked this pull request as draft September 6, 2024 09:04
@LecrisUT
Copy link
Contributor Author

LecrisUT commented Sep 6, 2024

Cool, I've moved the stuff around, I think after this the only things left to move would be Common and the Exceptions? Otherwise everything else would still be under utils but split into more manageable files.

I think it would also be good to start using __future__.annotations and .pyi files, it's so easy to hit circular dependencies on these.

Next step for this would be to detangle the some of these dataclasses and revert some of the redirects. In some cases it is obvious like tmt.utils.CommandOutput should be vanilla dataclass, and everything inheriting container should be the redirected, but then there are some grey ones like tmt.hardware.ConstraintComponents which are a bit in between those. Any thoughts on renaming dataclass to something distinct from vanilla dataclass/attrs.define? What about tmt_data/fmf_data, and maybe similarly tmt_field/fmf_field?

@martinhoyer
Copy link
Collaborator

I think it would also be good to start using future.annotations

+1

@happz
Copy link
Collaborator

happz commented Sep 6, 2024

Cool, I've moved the stuff around, I think after this the only things left to move would be Common and the Exceptions? Otherwise everything else would still be under utils but split into more manageable files.

Exceptions were on my list for sure, something like tmt.exceptions (or _exceptions). I was fine with Common staying where it was, maybe being moved into some tmt.utils submodule.

I think it would also be good to start using __future__.annotations and .pyi files, it's so easy to hit circular dependencies on these.

__future__.annotations sounds good, I'm much less fond of stub files, it always felt detached from the code & two places that need to be updated when changing signatures.

Next step for this would be to detangle the some of these dataclasses and revert some of the redirects. In some cases it is obvious like tmt.utils.CommandOutput should be vanilla dataclass

It already is, isn't it?

and everything inheriting container should be the redirected, but then there are some grey ones like tmt.hardware.ConstraintComponents which are a bit in between those. Any thoughts on renaming dataclass to something distinct from vanilla dataclass/attrs.define? What about tmt_data/fmf_data, and maybe similarly tmt_field/fmf_field?

Hmm, none of the options looks tempting, sorry. Why do you want to rename them? That might help us spend a couple of productive weeks, selecting the right name :)

@LecrisUT
Copy link
Contributor Author

LecrisUT commented Sep 6, 2024

__future__.annotations sounds good, I'm much less fond of stub files, it always felt detached from the code & two places that need to be updated when changing signatures.

I was thinking stub files as additive to keep things cleaner from parts like @overload, but indeed I can see it can easily get out-of-sync. There is one situation where it might be necessary, and that is for @deprecated because we cannot decorate an attribute.

Next step for this would be to detangle the some of these dataclasses and revert some of the redirects. In some cases it is obvious like tmt.utils.CommandOutput should be vanilla dataclass

It already is, isn't it?

and everything inheriting container should be the redirected, but then there are some grey ones like tmt.hardware.ConstraintComponents which are a bit in between those. Any thoughts on renaming dataclass to something distinct from vanilla dataclass/attrs.define? What about tmt_data/fmf_data, and maybe similarly tmt_field/fmf_field?

Hmm, none of the options looks tempting, sorry. Why do you want to rename them? That might help us spend a couple of productive weeks, selecting the right name :)

The idea here is that we can do a few more optimizations if we know that the dataclass is associated with a container/fmf object:

  • unify the attribute docstring and click help, which can help quite a bit for navigating the code with the IDEs
  • unify the fmf spec documentation with the python code (or if we are bold we could even generate/append the python docstring from the fmf files, but not sure if IDEs would render it)
  • generate json schema from the annotations
  • generally most other features could be handled by inheritance and __init_subclass__, but maybe some parts are easier done by overriding the dataclass instead, annotations stuff being one of them that I know of

But another main reason is for plugin developers to guarantee that any improvements made on the base tmt containers do not require any refactoring on their side.

@happz
Copy link
Collaborator

happz commented Sep 9, 2024

__future__.annotations sounds good, I'm much less fond of stub files, it always felt detached from the code & two places that need to be updated when changing signatures.

I was thinking stub files as additive to keep things cleaner from parts like @overload, but indeed I can see it can easily get out-of-sync. There is one situation where it might be necessary, and that is for @deprecated because we cannot decorate an attribute.

Hm, I could live with the current approach, something like field(..., deprecated=...).

Next step for this would be to detangle the some of these dataclasses and revert some of the redirects. In some cases it is obvious like tmt.utils.CommandOutput should be vanilla dataclass

It already is, isn't it?

and everything inheriting container should be the redirected, but then there are some grey ones like tmt.hardware.ConstraintComponents which are a bit in between those. Any thoughts on renaming dataclass to something distinct from vanilla dataclass/attrs.define? What about tmt_data/fmf_data, and maybe similarly tmt_field/fmf_field?

Hmm, none of the options looks tempting, sorry. Why do you want to rename them? That might help us spend a couple of productive weeks, selecting the right name :)

The idea here is that we can do a few more optimizations if we know that the dataclass is associated with a container/fmf object:

  • unify the attribute docstring and click help, which can help quite a bit for navigating the code with the IDEs
  • unify the fmf spec documentation with the python code (or if we are bold we could even generate/append the python docstring from the fmf files, but not sure if IDEs would render it)

WDYT about code being the source here? My idea was to make plugin code the single source of truth, and generate plugin documentation, schema, and everything from the code of the plugin. The same would apply to spec like test/plan/story fields.

  • generate json schema from the annotations

Yep, that's one of the goals :)

  • generally most other features could be handled by inheritance and __init_subclass__, but maybe some parts are easier done by overriding the dataclass instead, annotations stuff being one of them that I know of

But another main reason is for plugin developers to guarantee that any improvements made on the base tmt containers do not require any refactoring on their side.

I think we are aligned, at least in general, and what remains are the technicalities. For example, this move of "base container classes" into other own location, absolutely. What I'm less sure is whether we speak the same language when it comes to names, e.g.

Any thoughts on renaming dataclass to something distinct from vanilla dataclass/attrs.define? What about tmt_data/fmf_data, and maybe similarly tmt_field/fmf_field?

If you mean the original dataclasses.dataclass decorator, and you think we should use it in tmt code instead of "raw" @dataclasses.dataclass, then yes, I think that makes sense - for me, the benefit would be we could change the implementation of tmt.container.whateverdecorator without polluting diff with oneliners changing what decorator is used in all plugins and base. So this makes sense from this PoV. Just a couple of notes:

  • I would really like to avoid "data". I don't like "step data", I think "data" as a term is very vague, it can mean anything, depending on context, it's hard to speak about without ambiguity, and so on. I liked "container" because it seems less "data"-ish than dataclass and still represents what the class is used for, a container for, well, data. Fields. That's why I added various container_* helpers, so I for one would do something like from dataclasses import dataclass as container.
  • I like "field", it seems to be an acceptable term to both dataclasses and attrs, and it fits the namespace in my head. We have our custom implementation, once everything does from tmt.containers import field, do we even need to call it tmt_field/fmf_field? I'm not sure I understand the need for a distinct name here.
  • I'm not 100% sure we would be able to perform the transition as painlessly as I mentioned above, i.e. without polluting plugins with trivial import changes; given how entangled everything is internal, getting things perfect enough that we would just swap one implementation for another without plugins noticing is a very tall order - we may end up with some field_new or field_v2` temporarily, and I for one would be fine with that, as long as there is a good reason and a plan how to resolve it, in the next release, for example. We have some breathing space, although it would be fine to not regress in functionality too much, too often and for too long :)
  • I'd very much argue in favor of multiple PRs. While it's possible to put multiple nice commits into one PR, when the merge time comes, it's harder to squash what should be squashed, and preserve history. Also the review seems to work better when it's less changes in one PR, at least when a potential reviewer checks the PR for the first time - seeing 50 changed files may discourage even the bravest ones, despite the changes being put in nice & tidy commits.

@LecrisUT
Copy link
Contributor Author

LecrisUT commented Sep 9, 2024

I was thinking stub files as additive to keep things cleaner from parts like @overload, but indeed I can see it can easily get out-of-sync. There is one situation where it might be necessary, and that is for @deprecated because we cannot decorate an attribute.

Hm, I could live with the current approach, something like field(..., deprecated=...).

I am thinking of something slightly different. For field deprecation I think that's the only way we can do it currently until decorators can be attached to attributes and annotations, which I don't really know how that would work. What I was thinking of deprecating via stub files is if a plugin has:

from tmt.utils import field

which would deprecate to point the developer to tmt.dataclasses.field instead. For most cases we can create a shim, but if we want to deprecate a constant/variable, that is significantly more challenging. We can do it with the stub files though because we can simply lie about its type in there.

  • unify the fmf spec documentation with the python code (or if we are bold we could even generate/append the python docstring from the fmf files, but not sure if IDEs would render it)

WDYT about code being the source here? My idea was to make plugin code the single source of truth, and generate plugin documentation, schema, and everything from the code of the plugin. The same would apply to spec like test/plan/story fields.

👍 to that. Would work similar with the json schema generation.

  • generate json schema from the annotations

Yep, that's one of the goals :)

Here is some reference form scikit-build-core. I think we can simplify it a lot with jinja templating.

  • generally most other features could be handled by inheritance and __init_subclass__, but maybe some parts are easier done by overriding the dataclass instead, annotations stuff being one of them that I know of

But another main reason is for plugin developers to guarantee that any improvements made on the base tmt containers do not require any refactoring on their side.

I think we are aligned, at least in general, and what remains are the technicalities. For example, this move of "base container classes" into other own location, absolutely. What I'm less sure is whether we speak the same language when it comes to names, e.g.

Any thoughts on renaming dataclass to something distinct from vanilla dataclass/attrs.define? What about tmt_data/fmf_data, and maybe similarly tmt_field/fmf_field?

If you mean the original dataclasses.dataclass decorator, and you think we should use it in tmt code instead of "raw" @dataclasses.dataclass, then yes, I think that makes sense - for me, the benefit would be we could change the implementation of tmt.container.whateverdecorator without polluting diff with oneliners changing what decorator is used in all plugins and base. So this makes sense from this PoV.

Indeed we are in sync on this thought 👍.

Just a couple of notes:

* I would really like to avoid "data". I don't like "step data", I think "data" as a term is very vague, it can mean anything, depending on context, it's hard to speak about without ambiguity, and so on. I liked "container" because it seems less "data"-ish than `dataclass` and still represents what the class is used for, a container for, well, data. Fields. That's why I added various `container_*` helpers, so I for one would do something like `from dataclasses import dataclass as container`.

Hmm, makes sense. One alternative is attrs syntax of define, wdyt? I like the looks of tmt.containers.define much better than dataclass's snytax.

* I like "field", it seems to be an acceptable term to both dataclasses and attrs, and it fits the namespace in my head. We have our custom implementation, once everything does `from tmt.containers import field`, do we even need to call it `tmt_field`/`fmf_field`? I'm not sure I understand the need for a distinct name here.

This can be relaxed. It's mostly from navigating the tmt code and finding field() and it is not clear if it's dataclasses.field or tmt.utils.field. That of course can be fixed with ruff to request qualified names, and maybe we can ship some recommended ruff settings for plugin developing.

* I'm not 100% sure we would be able to perform the transition as painlessly as I mentioned above, i.e. without polluting plugins with trivial import changes; given how entangled everything is internal, getting things perfect enough that we would just swap one implementation for another without plugins noticing is a very tall order - we may end up with some `field_new` or field_v2` temporarily, and I for one would be fine with that, as long as there is a good reason and a plan how to resolve it, in the next release, for example. We have some breathing space, although it would be fine to not regress in functionality too much, too often and for too long :)

You mean w.r.t. review or with functionality? And I guess this is for the tmt code only, or maybe there are some other tmt plugins that I should keep in mind to preserve compatibility. If it's the former, then shouldn't it be resolved by splitting the PR into more minimal commits?

* I'd very much argue in favor of multiple PRs. While it's possible to put multiple nice commits into one PR, when the merge time comes, it's harder to squash what should be squashed, and preserve history. Also the review seems to work better when it's less changes in one PR, at least when a potential reviewer checks the PR for the first time - seeing 50 changed files may discourage even the bravest ones, despite the changes being put in nice & tidy commits.

👍 I can work with that. Let's finish the design discussion in this PR, and then I'll open a few separate PRs by cherry-picking form here.

@happz
Copy link
Collaborator

happz commented Sep 25, 2024

I was thinking stub files as additive to keep things cleaner from parts like @overload, but indeed I can see it can easily get out-of-sync. There is one situation where it might be necessary, and that is for @deprecated because we cannot decorate an attribute.

Hm, I could live with the current approach, something like field(..., deprecated=...).

I am thinking of something slightly different. For field deprecation I think that's the only way we can do it currently until decorators can be attached to attributes and annotations, which I don't really know how that would work. What I was thinking of deprecating via stub files is if a plugin has:

from tmt.utils import field

which would deprecate to point the developer to tmt.dataclasses.field instead. For most cases we can create a shim, but if we want to deprecate a constant/variable, that is significantly more challenging. We can do it with the stub files though because we can simply lie about its type in there.

I'm not sure we speak about the same kind of deprecation. I meant one when a "metadata key" or CLI option gets deprecated from users' PoV. Like --format:

@option(
    '--format', default=_test_export_default, 
    show_default=True,
    help='Output format.',
    deprecated=Deprecated('1.21', hint='use --how instead'),
    choices=_test_export_formats)

I'm starting tot think you meant deprecating use of dataclass.field itself. Is that correct?

  • unify the fmf spec documentation with the python code (or if we are bold we could even generate/append the python docstring from the fmf files, but not sure if IDEs would render it)

WDYT about code being the source here? My idea was to make plugin code the single source of truth, and generate plugin documentation, schema, and everything from the code of the plugin. The same would apply to spec like test/plan/story fields.

👍 to that. Would work similar with the json schema generation.

  • generate json schema from the annotations

Yep, that's one of the goals :)

Here is some reference form scikit-build-core. I think we can simplify it a lot with jinja templating.

Yep, I played a bit with Jinja and plugins while creating plugin docs, I got some general feeling, and this should be definitely doable. I would very much like to add some library that would let us reliably process annotations, manually it's a PITA to distinguish between int and list[int] with all corner cases. And I think such a library would be needed for this endeavor where we want to convert dataclass field into a JSON schema describing it... Maybe there's a piece in attrs we could use for this.

  • generally most other features could be handled by inheritance and __init_subclass__, but maybe some parts are easier done by overriding the dataclass instead, annotations stuff being one of them that I know of

But another main reason is for plugin developers to guarantee that any improvements made on the base tmt containers do not require any refactoring on their side.

I think we are aligned, at least in general, and what remains are the technicalities. For example, this move of "base container classes" into other own location, absolutely. What I'm less sure is whether we speak the same language when it comes to names, e.g.

Any thoughts on renaming dataclass to something distinct from vanilla dataclass/attrs.define? What about tmt_data/fmf_data, and maybe similarly tmt_field/fmf_field?

If you mean the original dataclasses.dataclass decorator, and you think we should use it in tmt code instead of "raw" @dataclasses.dataclass, then yes, I think that makes sense - for me, the benefit would be we could change the implementation of tmt.container.whateverdecorator without polluting diff with oneliners changing what decorator is used in all plugins and base. So this makes sense from this PoV.

Indeed we are in sync on this thought 👍.

Just a couple of notes:

* I would really like to avoid "data". I don't like "step data", I think "data" as a term is very vague, it can mean anything, depending on context, it's hard to speak about without ambiguity, and so on. I liked "container" because it seems less "data"-ish than `dataclass` and still represents what the class is used for, a container for, well, data. Fields. That's why I added various `container_*` helpers, so I for one would do something like `from dataclasses import dataclass as container`.

Hmm, makes sense. One alternative is attrs syntax of define, wdyt? I like the looks of tmt.containers.define much better than dataclass's snytax.

@container doesn't look bad either :) @define looks like a repetition to me - I am already defining a class, I know that, that's why I wrote class Foo, why do I need to @define it even more? On the other hand, @dataclass or @container seems like marking a class with its purpose, which makes more sense to me.

* I like "field", it seems to be an acceptable term to both dataclasses and attrs, and it fits the namespace in my head. We have our custom implementation, once everything does `from tmt.containers import field`, do we even need to call it `tmt_field`/`fmf_field`? I'm not sure I understand the need for a distinct name here.

This can be relaxed. It's mostly from navigating the tmt code and finding field() and it is not clear if it's dataclasses.field or tmt.utils.field. That of course can be fixed with ruff to request qualified names, and maybe we can ship some recommended ruff settings for plugin developing.

I suppose tmt.utils.field exists because dataclasses.field exists, and there is still a bit of a mess in terminology and what names are used for variables - field, key, option, sometimes field-ish name holds CLI option and so on. If we forbid dataclass.field, it should become clearer.

* I'm not 100% sure we would be able to perform the transition as painlessly as I mentioned above, i.e. without polluting plugins with trivial import changes; given how entangled everything is internal, getting things perfect enough that we would just swap one implementation for another without plugins noticing is a very tall order - we may end up with some `field_new` or field_v2` temporarily, and I for one would be fine with that, as long as there is a good reason and a plan how to resolve it, in the next release, for example. We have some breathing space, although it would be fine to not regress in functionality too much, too often and for too long :)

You mean w.r.t. review or with functionality? And I guess this is for the tmt code only, or maybe there are some other tmt plugins that I should keep in mind to preserve compatibility. If it's the former, then shouldn't it be resolved by splitting the PR into more minimal commits?

Both review and functionality. Smaller steps are easier to review, as long as there's some documented transition path that's being followed.

There are definitely out-of-tree plugins we need to care about, luckily we do run several tests on every tmt PR checking these as well, so we should get notified as soon as we break something. And often it may end up with fix of those plugins, again, if it's understood why and for how long a apparently weird change is needed, it's easier to sell and handle.

* I'd very much argue in favor of multiple PRs. While it's possible to put multiple nice commits into one PR, when the merge time comes, it's harder to squash what should be squashed, and preserve history. Also the review seems to work better when it's less changes in one PR, at least when a potential reviewer checks the PR for the first time - seeing 50 changed files may discourage even the bravest ones, despite the changes being put in nice & tidy commits.

👍 I can work with that. Let's finish the design discussion in this PR, and then I'll open a few separate PRs by cherry-picking form here.

@happz
Copy link
Collaborator

happz commented Sep 25, 2024

Adding this one to 1.38, let's see whether it can be done as a whole.

@happz happz added this to the 1.38 milestone Sep 25, 2024
@LecrisUT LecrisUT force-pushed the dataclasses branch 2 times, most recently from 620756d to c0ac303 Compare October 9, 2024 17:26
@LecrisUT
Copy link
Contributor Author

LecrisUT commented Oct 9, 2024

Attempt2. This time I've separated the usage into tmt.container and vanilla dataclaess.dataclass. One that is rather ambiguous is tmt.queue.Task, wondering if that should also be covered, maybe by a different container-like? Anyway, this one might be a bit easier to review when split into these 2 commits.

  • Currently mypy is mad at me because container = dataclasses.dataclass doesn't seem to be recognized well
  • I have only copied the implementation from tmt.utils, but I think tmt.container should be further split into smaller files
  • What happened in tmt.steps.prepare:Prepare.go().DependencyCollection, why is that dataclass definition inside a function?

@happz
Copy link
Collaborator

happz commented Oct 11, 2024

Attempt2. This time I've separated the usage into tmt.container and vanilla dataclaess.dataclass. One that is rather ambiguous is tmt.queue.Task, wondering if that should also be covered, maybe by a different container-like? Anyway, this one might be a bit easier to review when split into these 2 commits.

I'll definitely check it out before I'm done today. It looks like a good candidate for 1.38.

  • Currently mypy is mad at me because container = dataclasses.dataclass doesn't seem to be recognized well

I think mypy even has some special handling for dataclasses, the assignment might break it, yada yada, we'll end up with a hopefully temporary # type: ignore...

  • I have only copied the implementation from tmt.utils, but I think tmt.container should be further split into smaller files

Could be, although I for one would be happy by the move alone, more granularity might make things harder to follow again. It's an extreme example now, I'd like to avoid swinging to the opposite side of granularity scale :) But let's see.

  • What happened in tmt.steps.prepare:Prepare.go().DependencyCollection, why is that dataclass definition inside a function?

Nothing serious, it's a class that's used only in that method, so I made it a nested one. It can be there, it doesn't have to be there, it's up to us. It's probably the only one in tmt codebase, I suppose it could be argued that it should move out of the method.

@thrix thrix modified the milestones: 1.38, 1.39 Oct 25, 2024
@thrix thrix modified the milestones: 1.39, 1.40 Nov 5, 2024
@thrix thrix added the priority | must high priority, must be included in the next release label Nov 5, 2024
@psss
Copy link
Collaborator

psss commented Nov 26, 2024

Already marked as a must-have, proposed for now for 1.40. Let's try to rebase on the latest main and finish until next Friday, if possible.

@LecrisUT LecrisUT force-pushed the dataclasses branch 2 times, most recently from 9115326 to d6f43ee Compare November 26, 2024 12:56
Signed-off-by: Cristian Le <[email protected]>
Signed-off-by: Cristian Le <[email protected]>
Signed-off-by: Cristian Le <[email protected]>
Signed-off-by: Cristian Le <[email protected]>
@happz happz modified the milestones: 1.40, 1.41 Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority | must high priority, must be included in the next release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants